{
 "metadata": {
  "name": "",
  "signature": "sha256:d7031dbc27d4012963cb7521522680f9922ffa100ae73e625bb01da2250e5180"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Large Scale Structure in Social Networks"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Assortativity, Modularity, and Community Structure"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A network is a complicated system with both small-scale structure (\"what is the average degree of a node?\", \"how many nodes are involved in triangles?\") that generally depends only on individual nodes and thier neighbors, and large-scale structure (\"what is the diameter of the graph?\", \"how are sets of nodes connected?\"). \n",
      "\n",
      "Quantifying network structure, and deciding whether or not that structure is meaningful, depends strongly on a notion of *average* network structure. In other words, we can understand structural measurements on a network by deciding how much they deviate from an *average* network, or an appropriately selected *null model*.\n",
      "\n",
      "The following measurements of network structure all try to determine how much more connected certain nodes are than average. In order to define an average, all of these measurements use a null model known as the *configuration model*. The configuration model maintains the degree distribution of a network $\\vec{k}$, but rearranges its edges. For any given node, we are essentially asking the question: \"are the *number* of connections of this node special, or the *types* of other nodes that it is connected to?\"\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Some terms and common symbols:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- **adjacency matrix**: A (square) matrix $A$ with one row/col per node in the network. An entry of 1 at the location $(i,j)$ indicates an edge from node $i$ to node $j$\n",
      "- **degree**: $k_i$, the number of edges connected to a given node $i$\n",
      "- **in-degree**: $k^{in}_i$, the number of directed edges pointing to a given node in a directed graph\n",
      "- **out-degree**: $k^{out}_i$, the number of directed edges pointing out of a given node in a directed graph\n",
      "- **degree distribution**: $\\vec{k}$, the distribution of all of the degree values of nodes in the graph\n",
      "- **simple graph**: A graph without *multi-edges* (only one edge is possible between two nodes) and *self-loops* (a node cannot link to itself)\n",
      "- **total edges**: $m$, the total number of edges in the graph "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Import some libraries\n",
      "from __future__ import division\n",
      "import igraph \n",
      "import numpy as np\n",
      "from random import shuffle, random\n",
      "import matplotlib.pyplot as plt\n",
      "from IPython.display import Image\n",
      "from tabulate import tabulate\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# get some graphs to use as example graphs\n",
      "color_dict = {1: \"blue\", 2: \"red\", 3: \"green\", 4: \"yellow\", 5: \"purple\", 6: \"pink\"}\n",
      "shape_dict = {1: \"rectangle\", 2: \"circle\", 3: \"triangle-up\", 4: \"triangle-down\"}\n",
      "# Zachary karate club graph\n",
      "karate = igraph.Nexus.get(\"karate\")\n",
      "karate_adjmat = np.matrix([x for x in karate.get_adjacency()])\n",
      "karate_labels = [int(x) for x in karate.vs[\"Faction\"]]\n",
      "karate.vs[\"color\"] = [color_dict[x] for x in karate_labels]\n",
      "karate.vs[\"shape\"] = [shape_dict[x] for x in karate_labels]\n",
      "karate.vs[\"label\"] = karate.vs[\"name\"]\n",
      "print karate.summary()\n",
      "print igraph.Nexus.info(\"karate\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(karate)\n",
      "# Orrr, if you're lame and you couldn't get Cairo to work on your machine :P, uncomment\n",
      "#Image(filename=\"karate_club_graph.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# UK faculty graph (interactions between faculties at UK universities)\n",
      "UKfaculty = igraph.Nexus.get(\"UKfaculty\")\n",
      "UKfaculty_adjmat = np.matrix([x for x in UKfaculty.get_adjacency()])\n",
      "UKfaculty_labels = [int(x) for x in UKfaculty.vs[\"Group\"]]\n",
      "UKfaculty.vs[\"color\"] = [color_dict[x] for x in UKfaculty_labels]\n",
      "UKfaculty.vs[\"shape\"] = [shape_dict[x] for x in UKfaculty_labels]\n",
      "print UKfaculty.summary()\n",
      "print igraph.Nexus.info(\"UKfaculty\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(UKfaculty)\n",
      "# Orrr, if you're lame and you couldn't get Cairo to work on your machine :P, uncomment\n",
      "#Image(filename=\"UKfaculty_graph.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Configuration Model"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A refresher from networks-201. \n",
      "\n",
      "Last time we talked about null models as a concept for graphs and a base point for comparing graph structure. We talked briefly about the *configuration model*, which is a random graph model that has exactly the same degree distribution, but a different structure, than some original graph.  Basically, if a graph G has 10 nodes with degrees $\\vec{k} = [1,5,2,9,3,6,5,4,1,8]$, then some configuration model graph, G_random would (ideally) have the exact same $\\vec{k}$, but the nodes are connected to each other differently (have different neighbors). In reality the algorithm that we use to generate a configuration model of the graph sometimes generates self-loops and multi-edges, which are deleted, leaving the generated graph with slightly fewer edges than the original. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def config_model(graph):\n",
      "    # build a random graph based on the configuration model\n",
      "    C = graph.copy()\n",
      "    C.delete_edges(None)\n",
      "    # graph with the same edge list as G\n",
      "    # Add random edges\n",
      "    if graph.is_directed():\n",
      "        # vertex list A\n",
      "        A_in = []\n",
      "        A_out = []\n",
      "        for v in graph.vs().indices:\n",
      "            for x in range(0, graph.indegree(v)):\n",
      "                A_in.append(v)\n",
      "            for x in range(0, graph.outdegree(v)):\n",
      "                A_out.append(v)\n",
      "        shuffle(A_in)\n",
      "        shuffle(A_out)\n",
      "        # make the edge list\n",
      "        _E = [(A_out[x], A_in[x]) for x in range(0, len(A_in))]\n",
      "    else:\n",
      "        # vertex list A\n",
      "        A = []\n",
      "        for v in graph.vs().indices:\n",
      "            for x in range(0, graph.degree(v)):\n",
      "                A.append(v)\n",
      "        shuffle(A)\n",
      "        # make the edge list\n",
      "        _E = [(A[2*x], A[2*x+1]) for x in range(0,int(len(A)/2))]\n",
      "    E = set([x for x in _E if x[0]!=x[1]])\n",
      "    # add the edges to C\n",
      "    C.add_edges(E)\n",
      "    \n",
      "    # return the configuration model generated graph\n",
      "    return C"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For any given network, if we created a configuration model based on that network, the expected number of edges between nodes $a$ and $b$ would be $\\frac{k_ak_b}{2m}$ where $k_a$ and $k_b$ are the degrees of nodes $a$ and $b$ respectively, and $m$ is the number of edges in the network. \n",
      "\n",
      "To justify that claim, consider the following. A node $a$ with degree $k_a$ has $k_a$ edges attached to it. Each of those edges has 2 ends, one attached to node $a$, and the other attached (randomly) to some other node in the network. In the configuration model, each node has a fixed degree, so the probablity of a given edge attaching to a node $b$ with degree $k_b$ is $\\frac{k_b}{2m}$ where $k_b$ is the number of ends of edges attached to $b$, and $2m$ is the number of ends of edges in the network (each edge $m$ has two ends). Then the expected number of connections between two vertices (which can usually be interpreted as the probability of a connection between two vertices) is:\n",
      "$$E[\\text{edge}] = k_a\\frac{k_b}{2m}$$\n",
      "\n",
      "Interpreting this quantity as a probability for a simple graph (a graph without multi-edges and self-loops) assumes that the graph is sparse (the number of edges is much smaller than the number of possible edges) and that the edges are not overly concentrated in a few nodes. For many social graphs, these are fine assumptions. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Calculate the expected number of edges between each pair of nodes, using the configuration model\n",
      "def expected_edges(adjmat):\n",
      "    (i, j) = adjmat.shape\n",
      "    m = np.sum(adjmat)/2\n",
      "    k_mat = np.sum(adjmat,0)\n",
      "    k = [k_mat[0,x] for x in range(0,i)]\n",
      "    q = np.zeros((i,j))\n",
      "    for row in xrange(0,i):\n",
      "        for col in xrange(0,j):\n",
      "            if (row != col):\n",
      "                q[row,col] = k[row]*k[col]/(2*m)\n",
      "    return q\n",
      "\n",
      "# calculate the fraction of times two nodes are connected in the null model\n",
      "def frac_of_times_connected(graph, iterations = 1000):\n",
      "    config_adjmat = np.zeros((len(graph.vs), len(graph.vs)))\n",
      "    for i in range(0, iterations):\n",
      "        config = config_model(graph)\n",
      "        config_adjmat += np.matrix([x for x in config.get_adjacency()])\n",
      "    frac_connected = np.divide(config_adjmat, float(iterations))\n",
      "    return frac_connected\n",
      "\n",
      "fig, axes = plt.subplots(1,3, figsize=(15,5))\n",
      "print \"Axes are labeled with vertex #\"\n",
      "axes[0].set_title(\"Expected # edges\")\n",
      "axes[0].matshow(expected_edges(karate_adjmat), cmap = 'gist_heat_r')\n",
      "axes[1].set_title(\"Fraction connected in the model\")\n",
      "axes[1].matshow(frac_of_times_connected(karate, iterations = 1000), cmap = 'gist_heat_r')\n",
      "axes[2].set_title(\"Edges in the actual graph\")\n",
      "axes[2].matshow(karate_adjmat, cmap = 'gist_heat_r')\n",
      "plt.show()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice how vertices 0 and 34 *aren't* connected in the real graph (0 and 34 are John A and Mr Hi respectively, the president and the instructor who led the splitting factions), but they are *almost always* connected in the configuration model of the karate graph. That is because they both have a very high degree, and, in edges were assigned at random, would likely connect with eachother. In case you were curious how the names correlate with factions:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print tabulate({\"Mr Hi (Actor 0), Faction 1\": [x[\"name\"] for x in karate.vs if x[\"Faction\"] == 1.0], \n",
      "                \"John A (Actor 34), Faction 2\": [x[\"name\"] for x in karate.vs if x[\"Faction\"] == 2.0]}, headers = \"keys\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Modularity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "To what extent does like connect to like? \n",
      "\n",
      "Reference: *Networks* Ch 7 S13, *Homophily and Assortative Mixing*\n",
      "\n",
      "Modularity, $Q$, measures the fraction of edges connecting like (or unlike) nodes in a network. To do this, we take the difference between that actual fraction of edges connecting nodes fo the same type,  Starting with $E[edge] = k_a\\frac{k_b}{2m}$ for any given nodes $a$ and $b$, \n",
      "we can then say that:\n",
      "$$E[\\text{edges connecting nodes of the same type}] = \\frac{1}{2}\\sum_{ij}\\frac{k_ik_j}{2m}\\delta(c_i, c_j) = \\text{(for directed graphs)} \\frac{1}{2}\\sum_{ij}\\frac{k_i^{in}k_j^{out}}{2m}\\delta(c_i, c_j)$$\n",
      "Where $c_i$ denotes the \"type\" of node $i$ and $$ \n",
      "   \\delta(c_i, c_j) = \\left\\{\n",
      "     \\begin{array}{lr}\n",
      "       1 & : c_i = c_j\\\\\n",
      "       0 & : c_i \\neq c_{j}\n",
      "     \\end{array}\n",
      "   \\right.\n",
      " $$\n",
      " \n",
      "We can also say that the actual number of edges connecting nodes of like types is:\n",
      "$$\\text{Edges connecting like nodes} = \\frac{1}{2}\\sum_{ij}A_{ij}\\delta(c_i, c_j)$$\n",
      "($A$ is the adjacency matrix, where $A_{ij} = 1$ if there is an edge between node $i$ and node $j$, and 0 else.\n",
      "\n",
      "Then the difference between the actual number of in-group edges and the expected number, normalized by the number of edges in the network is given by:\n",
      "\n",
      "$$Q = \\frac{1}{2m}\\sum_{ij}\\left(A_{ij} - \\frac{k_ik_j}{2m}\\right)\\delta(c_i, c_j)$$\n",
      "\n",
      "We call $Q$ modularity."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def modularity(adjmat, labels):\n",
      "    (i, j) = adjmat.shape\n",
      "    m = np.sum(adjmat)/2\n",
      "    kin_mat = np.sum(adjmat,0)\n",
      "    kout_mat = np.sum(adjmat,1)\n",
      "    kin = [kin_mat[0,x] for x in range(0,i)]\n",
      "    kout = [kout_mat[x,0] for x in range(0,i)]\n",
      "    q = 0\n",
      "    for row in xrange(0,i):\n",
      "        for col in xrange(0,j):\n",
      "            if (labels[row] == labels[col]):\n",
      "                q += (adjmat[row, col] - kin[row]*kout[col]/(2*m))\n",
      "    Q = q/(2*m)\n",
      "    return Q"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Assortativity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we have labels on the nodes in the network (i.e., \"users whose primary language is X\"), are users with the same label more connected than average? If we have scalar attributes on the network (i.e., \"a user is X years old\"), are users with similar attributes more connected than average?\n",
      "\n",
      "Reference: *Mixing Patterns in Networks* by MEJ Newman"
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Assortative mixing by discrete node characteristics (i.e., language spoken)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Discrete assortative mixing is exactly a normalized modularity measurement on node attributes.\n",
      "\n",
      "We can normalize $Q$ by dividing it by its maximal value on a graph (that is, the value that it would take if all nodes were only conencted to nodes with the same labels).\n",
      "\n",
      "$$r = \\frac{Q}{Q_{max}} = \\frac{\\sum_{ij}\\left(A_{ij} - k_ik_j/2m\\right)\\delta(c_i, c_j)}{\\sum_{ij}\\left(1 - k_ik_j/2m\\right)\\delta(c_i, c_j)} = \\text{(for directed graphs) } \\frac{\\sum_{ij}\\left(A_{ij} - k_i^{in}k_j^{out}/2m\\right)\\delta(c_i, c_j)}{\\sum_{ij}\\left(1 - k_i^{in}k_j^{out}/2m\\right)\\delta(c_i, c_j)} $$\n",
      "\n",
      "\n",
      "\n",
      "Now $r$ falls between -1 and 1 for all networks, where a value of $r=1$ is a perfectly assortative network (like connects with like) and $r=-1$ is a perfectly dissassortative network (like connects with dislike)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def normalized_modularity(adjmat, labels):\n",
      "    (i, j) = adjmat.shape\n",
      "    m = np.sum(adjmat)/2\n",
      "    kin_mat = np.sum(adjmat,0)\n",
      "    kout_mat = np.sum(adjmat,1)\n",
      "    kin = [kin_mat[0,x] for x in range(0,i)]\n",
      "    kout = [kout_mat[x,0] for x in range(0,i)]\n",
      "    Q = 0\n",
      "    Qmax = 0\n",
      "    for row in xrange(0,i):\n",
      "        for col in xrange(0,j):\n",
      "            if labels[row] == labels[col]:\n",
      "                d = 1\n",
      "            else:\n",
      "                d = 0\n",
      "            Q += (adjmat[row, col] - kin[row]*kout[col]/(2*m))*d\n",
      "            Qmax += adjmat[row, col] - (kin[row]*kout[col]/(2*m))*d\n",
      "    r = Q/Qmax\n",
      "    return r"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Warning: I'm about to gloss over a little algebra.\n",
      "\n",
      "Define the matrix $\\mathbf{e}$ as a matrix whose entries $\\mathbf{e}_{ij}$ are the fraction of edges in the graph connecting nodes of type $i$ to nodes of type $j$. Call $\\mathbf{e}$ the *mixing matrix* of the graph. \n",
      "\n",
      "Then with some algebra (pg 225 of *Networks* & in *Mixing Patterns of Networks*) we can turn our expression for normalized modularity into this expression: \n",
      "$$ r = \\frac{\\mathbf{Tr}(e) - ||\\mathbf{e}^2||}{1- || \\mathbf{e}^2||} $$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Calculate the mixing matrix e\n",
      "def mixing_matrix(adjmat, labels):\n",
      "    (i, j) = adjmat.shape\n",
      "    m = np.sum(adjmat)\n",
      "    labels_to_indices = {}\n",
      "    index = 0\n",
      "    for l in labels:\n",
      "        if l not in labels_to_indices.keys():\n",
      "            labels_to_indices[l] = index\n",
      "            index += 1\n",
      "    e = np.zeros((index, index))\n",
      "    for row in xrange(0,i):\n",
      "        for col in xrange(0,j):\n",
      "            e[labels_to_indices[labels[row]], labels_to_indices[labels[col]]] += adjmat[row, col]\n",
      "    e = np.divide(e, m)\n",
      "    return e\n",
      "\n",
      "# Use e to calculate the assortativity\n",
      "def assortativity(adjmat, labels):\n",
      "    e = mixing_matrix(adjmat, labels)\n",
      "    sum_e_squared = np.sum(np.dot(e,e))\n",
      "    r = (e.trace() - sum_e_squared)/(1 - sum_e_squared)\n",
      "    return r"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"----------------------------------------------------\"\n",
      "print \"The mixing matrix for the karate club graph:\"\n",
      "print mixing_matrix(karate_adjmat, karate_labels)\n",
      "print \"'Normalized modularity' of the karate club graph: {}\".format(normalized_modularity(karate_adjmat, karate_labels))\n",
      "print \"'Assortativity' of the karate club graph: {}\".format(assortativity(karate_adjmat, karate_labels))\n",
      "\n",
      "print \"----------------------------------------------------\"\n",
      "print \"The mixing matrix for the UKfaculty graph:\"\n",
      "print mixing_matrix(UKfaculty_adjmat, UKfaculty_labels)\n",
      "print \"'Normalized modularity' of the UKfaculty graph: {}\".format(normalized_modularity(UKfaculty_adjmat, UKfaculty_labels))\n",
      "print \"'Assortativity' of the UKfaculty graph: {}\".format(assortativity(UKfaculty_adjmat, UKfaculty_labels))\n",
      "print \"----------------------------------------------------\"\n",
      "print \"Hint: those last two numbers are the same\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Assortative mixing by scalar node characteristics (i.e., age)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can also measure the assortativity of a network on scalar characteristics, such as age. In this way, the question is, \"do nodes with more similar characteristics associate more?\" \n",
      "\n",
      "The following is essentially a correlation coefficient calculation on node connectivity over scalar charcteristics $x$.\n",
      "\n",
      "$$r =  \\frac{\\sum_{ij}\\left(A_{ij} - k_i^{in}k_j^{out}/2m\\right)x_i x_j}{\\sum_{ij}\\left(A_{ij}x_i^2 - \\left(k_i^{in}k_j^{out}/2m \\right) x_i x_j\\right)}$$ "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def assortativity_scalar(adjmat, values):\n",
      "    (i, j) = adjmat.shape\n",
      "    m = np.sum(adjmat)/2\n",
      "    kin_mat = np.sum(adjmat,0)\n",
      "    kout_mat = np.sum(adjmat,1)\n",
      "    kin = [kin_mat[0,x] for x in range(0,i)]\n",
      "    kout = [kout_mat[x,0] for x in range(0,i)]\n",
      "    covar = 0\n",
      "    var = 0\n",
      "    for row in xrange(0,i):\n",
      "        for col in xrange(0,j):\n",
      "            covar += (adjmat[row, col] - kin[row]*kout[col]/(2*m))*values[row]*values[col]\n",
      "            var += adjmat[row, col]*values[row]**2 - (kin[row]*kout[col]/(2*m))*values[row]*values[col]\n",
      "    r = covar/var\n",
      "    return r"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One very common scalar assortativity coefficient is measuring assortativity by degree. There are more clever ways to use the graph structure to compute this, but I'll stick with the above algorithm for now. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assortativity_scalar(karate_adjmat, karate.degree())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Error measurements on assortativity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How can we be sure of the validity of an assortativity measurement?\n",
      "\n",
      "In this model, we assume that every edge has independent probabilty of existing, where the probabilty of an edge between two nodes is related (somehow) to thier individual characteristics (this formulation of a network leads to the idea of the Stochastic Block Model). By assuming every edge in independent, we can ask the question \"how much does this measurement change with a small perturbation in the graph structure?\" Using small perturbation = removing one edge, calculate the expected standard deviation of assortativity as:\n",
      "\n",
      "$$ \\sigma_r^2 = \\sum^M_{i=1}(r_i - r)^2 $$\n",
      "\n",
      "where $r_i$ is the value of $r$ in which the $i$th edge is removed.\n",
      "\n",
      "This is *not* the most efficient way to make an error measurement here (on a graph with more than a hundred or so nodes, it takes ages), but it is the most intuitive."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def assortativity_error(adjmat, labels, directed, assortativity_function):\n",
      "    (i, j) = adjmat.shape\n",
      "    # list of all of the edges as positions in the adjmat\n",
      "    # make sure not to double-count edges if the network is not directed\n",
      "    edges= []\n",
      "    for x in xrange(0,i):\n",
      "        for y in xrange(0,j):\n",
      "            if adjmat[x,y] != 0:\n",
      "                if directed:\n",
      "                    edges.append((x,y))\n",
      "                if not directed:\n",
      "                    edges.append(tuple(sorted([x,y])))\n",
      "    if not directed:\n",
      "        edges = list(set(edges))\n",
      "    # original assortativity r\n",
      "    r = assortativity_function(adjmat, labels)\n",
      "    sigma = 0\n",
      "    # remove one edge at a time, and compute assortativity\n",
      "    for edge in edges:\n",
      "        adjmat[edge] = 0\n",
      "        sigma += (assortativity_function(adjmat, labels) - r)**2\n",
      "        adjmat[edge] = 1\n",
      "    return sigma"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assortativity_error(UKfaculty_adjmat, UKfaculty_labels, True, assortativity)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Maximizing Modularity"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How can we find sets of nodes that are more tightly (structurally) associated than average? Once we have defined modularity, how do we use it to determine tightly clustered nodes?\n",
      "\n",
      "Reference: *Modularity and Community Structure in Networks* by MEJ Newman, *Finding Community Structure in Very Large Networks* by Clauset, Newman and Moore\n",
      "\n",
      "The answer: by finding the partitions (labelings) of nodes in the graph that maximizes modularity (minimizes out-group edges, maximizes in-group edges). Often those partitions are called communities in the graph.\n",
      "\n",
      "There are many implementations of algorithms to maximize modularity, but most rely on greedy aglomerative methods of grouping nodes. Known as the CNM algorithm, or a fast-greedy method for modularity maximization, the algortithm essentially does the following:\n",
      "0. Begin by giving each node its own label (communities of one)\n",
      "1. Find the best combination of two communities that *increases* modularity of the network the most\n",
      "2. Give those nodes the same label\n",
      "3. Repeat until all of the nodes have the same label\n",
      "4. Look back at the sequence of node groupings you just performed and chose where to sto (modularity was maximized, or you have the number of communities you want)\n",
      "\n",
      "I won't implemet the algortithm here, I will just show an example from the igraph python library.\n",
      "\n",
      "You can actually find a few more examples of using igraph community-detection algortihms in network-igraph-101"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Make the dendrogram according the fast-greedy aglomerative algorithm\n",
      "karate_dendrogram = karate.community_fastgreedy()\n",
      "# Find the community partitions that maximize modularity\n",
      "karate_communities_opt = karate_dendrogram.as_clustering()\n",
      "i = 0\n",
      "for community in karate_communities_opt:\n",
      "    i += 1\n",
      "    for v in community:\n",
      "        karate.vs[v][\"karate_communities_opt\"] = i\n",
      "# Find the maximal-modularity 2-division (cheating: we are looking for 2 factions because we know this graph)\n",
      "karate_communities_2 = karate_dendrogram.as_clustering(2)\n",
      "i = 0\n",
      "for community in karate_communities_2:\n",
      "    i += 1\n",
      "    for v in community:\n",
      "        karate.vs[v][\"karate_communities_2\"] = i\n",
      "# Show the dendrogram\n",
      "igraph.plot(karate_dendrogram)\n",
      "# Or just open the PNG \n",
      "#Image(filename=\"karate_dendrogram.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The optimal modularity community assignment"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# shapes = \"Faction\" in the social network\n",
      "# colors = community assigned by the algorithm\n",
      "igraph.plot(karate, vertex_color = [color_dict[x] for x in karate.vs[\"karate_communities_opt\"]])\n",
      "# Or the PNG\n",
      "#Image(filename=\"karate_comms_3.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The 2-community grouping"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# shapes = \"Faction\" in the social network\n",
      "# colors = community assigned by the algorithm \n",
      "igraph.plot(karate, vertex_color = [color_dict[x] for x in karate.vs[\"karate_communities_2\"]])\n",
      "#Image(filename=\"karate_comms_2.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Just for fun, let's draw the UKfaculty graph too. I won't show the dendrogram though: too many nodes."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# This algorithm only takes undirected graphs, so we get to use some graph manipulation tools from earlier\n",
      "UKfaculty_undirected = UKfaculty.copy()\n",
      "UKfaculty_undirected.to_undirected(mode = \"collapse\", combine_edges = {'weight': sum})\n",
      "UKfaculty_dendrogram = UKfaculty_undirected.community_fastgreedy(weights = 'weight')\n",
      "# you can set the number of communities by setting n = \n",
      "# UKfaculty_communities = UKfaculty_dendrogram.as_clustering(n = 4)\n",
      "UKfaculty_communities = UKfaculty_dendrogram.as_clustering()\n",
      "i = 0\n",
      "for community in UKfaculty_communities:\n",
      "    i += 1\n",
      "    for v in community:\n",
      "        UKfaculty.vs[v][\"UKfaculty_communities\"] = i\n",
      "igraph.plot(UKfaculty, vertex_color = [color_dict[x] for x in UKfaculty.vs[\"UKfaculty_communities\"]])\n",
      "# or PNG\n",
      "#Image(filename=\"UKfaculty_comms.png\", width=600)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The real world is messy!"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Practical Limitations of Modularity Maximization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Reference: *Resolution Limit in Community Detection* by Fortunato and Barth\u00e9lemy"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ring = igraph.Graph.Ring(36)\n",
      "ring_adjmat = np.matrix([x for x in ring.get_adjacency()])\n",
      "igraph.plot(ring)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Make the dendrogram according the fast-greedy aglomerative algorithm\n",
      "ring_dendrogram = ring.community_fastgreedy()\n",
      "# Find the community partitions that maximize modularity\n",
      "ring_communities = ring_dendrogram.as_clustering()\n",
      "i = 0\n",
      "for community in ring_communities:\n",
      "    i += 1\n",
      "    for v in community:\n",
      "        ring.vs[v][\"community\"] = i\n",
      "# Show the dendrogram\n",
      "igraph.plot(ring_dendrogram)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "igraph.plot(ring, vertex_color = [color_dict[x] for x in ring.vs[\"community\"]])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The modularity-maximization community detection algortihm just found communities in a graph where every vertex is identical. What gives?\n",
      "\n",
      "It turns out that there is a natural \"resolution limit\" for modularity measurements--the concept of maximizing modularity can never resolve communities that are less than some characteristic fraction of the size of the entire network. Let's derive things.\n",
      "\n",
      "Focusing on an *undirected* network (because things are complicated enough already), recall modularity:\n",
      "$$Q = \\frac{1}{2m}\\sum_{ij}\\left(A_{ij} - \\frac{k_ik_j}{2m}\\right)\\delta(c_i, c_j)$$\n",
      "$$ = \\sum_{ij}\\left(\\frac{A_{ij}}{2m}\\right)\\delta(c_i, c_j) - \\sum_{ij}\\left(\\frac{k_ik_j}{(2m)^2}\\right)\\delta(c_i, c_j) $$\n",
      "Now say that there are $P$ total partitions, corresponding to $c = 1,2,3,...P$. The terms of the sum are *only* non-zero when $c_i = c_j$, so split the sum up by partition, $c$, each time considering only nodes in the partition $c$:\n",
      "$$ = \\sum_{c=1}^P \\left( \\sum_{ij, c_i = c_j = c}\\left(\\frac{A_{ij}}{2m}\\right) - \\sum_{ij, c_i = c_j = c}\\left(\\frac{k_ik_j}{(2m)^2}\\right) \\right) $$\n",
      "\n",
      "Then $\\sum_{ij, c_i = c_j = c}A_{ij}$ is simply equal to the total number of edges inside the partition where $c_i=c_j=c$, times 2 because we count $A_{ij}$ and $A_{ji}$ call the value $=2m_c$. \n",
      "\n",
      "$$ = \\sum_{c=1}^P \\left( \\frac{m_c}{m}- \\frac{1}{(2m)^2}\\sum_{ij, c_i = c_j = c}k_ik_j\\right) $$\n",
      "\n",
      "And some algebra can show that $\\sum_{ij, c_i = c_j = c}k_ik_j = \\left(\\sum_{ij, c_i = c}k_i\\right)^2$. Call $\\sum_{ij, c_i = c}k_i = K_c$, the total degree of all of the nodes in partition $c$.\n",
      "\n",
      "(If you don't believe me, write out the simple case: $i,j = 1,2$, then $\\sum_{ij}k_ik_j = k_1^2 + 2k_1k_2 + k_2^2 = (k_1 + k_2)^2 = \\left(\\sum_{ij}k_i\\right)^2 $)\n",
      "\n",
      "So:\n",
      "\n",
      "$$ Q = \\sum_{c=1}^P \\left( \\frac{m_c}{m}- \\left(\\frac{K_c}{2m}\\right)^2 \\right) $$\n",
      "\n",
      "Each term in the sum corresponds to a partition, and the term is positive (the partition makes a positive contribution to the total modularity of the graph) if $ \\frac{m_c}{m}- \\left(\\frac{K_c}{2m}\\right)^2 > 0 $.\n",
      "\n",
      "$K_c = = 2m_c + m_c^{out}$ (in summing the degrees of the nodes, each in-community edge gets counted twice), where again $m_c$ is the total number of edges inside the community and $m_c^{out}$ is the total number of edges connecting the community to the rest of the graph. \n",
      "\n",
      "$$ Q = \\sum_{c=1}^P \\left( \\frac{m_c}{m}- \\left(\\frac{2m_c + m_c^{out}}{2m}\\right)^2 \\right) $$\n",
      "\n",
      "Now consider the ring graph, or a more general ring graph where each node in the graph shown above is its own connected community. Assuming a fully connected graph, these communities each have the minimum number of out-connections, $m_c^{out} =2$, and with the minumal number of out-connections, should be maximally modular. For this graph: \n",
      "\n",
      "$$ Q = \\sum_{c=1}^P \\left( \\frac{m_c}{m}- \\left(\\frac{2m_c + 2}{2m}\\right)^2 \\right) = \\sum_{c=1}^P \\left( \\frac{m_c}{m}- \\left(\\frac{m_c + 1}{m}\\right)^2 \\right)$$\n",
      "\n",
      "And also on the ring graph, the total number of between-community edges is equal to the total number of partitions $P$, so $\\sum_{c=1}^P m_c = m - P$\n",
      "\n",
      "$$ Q = \\frac{m - P}{m} - \\sum_{c=1}^P \\left(\\frac{m_c + 1}{m}\\right)^2 $$\n",
      "\n",
      "Assuming a connected graph (all modules have at least 1 out-link) and given a constant total number of edges $m$, the above expression for $Q$ is maximized when each community has the same number of internal links $m_c = m/P - 1$ (a more rigorous demonstration of this is in the paper, but it should make intuitive sense). Then:\n",
      "\n",
      "$$ Q_{max} = \\frac{m - P}{m} - P\\left(\\frac{\\frac{m}{P} - 1+ 1}{m}\\right)^2 = 1 - \\frac{P}{m} - P\\left(\\frac{1}{P}\\right)^2 = 1 - \\frac{P}{m} - \\frac{1}{P}$$\n",
      "\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "P = range(1,400)\n",
      "m = 400\n",
      "Qmax = [1-(p/m)-(1/p) for p in P]\n",
      "plt.plot(P, Qmax)\n",
      "plt.title(\"Maximal modularity on a constructed ring graph with 400 nodes\")\n",
      "plt.xlabel(\"Number of modules\")\n",
      "plt.ylabel(\"Modularity Q\")\n",
      "ylim = plt.ylim([-1,1])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now find the maximum $Q$ of this maximally modular network in terms of the number of modules (number of partitions $P$). Find $\\frac{\\delta Q_{max}}{\\delta P} = -\\frac{1}{m} + \\frac{1}{P^2} = 0$ at $P = \\sqrt{m}$\n",
      "\n",
      "**The maximal number of partitions $P$ is $\\sqrt{m}$.**\n",
      "\n",
      "Then, assuming that the communities are connected and since we have $m$ edges in the network, only $P$ out-community edges, and the modules have the same number of edges, each community must have $m_c = \\sqrt{m} - 1$ in-community edges. Assuming that the communities are connected subgraphs, $m_c = \\sqrt{m} - 1$ in-edges can connect at most $\\sqrt{m}$ nodes.\n",
      "\n",
      "**The maximum number of nodes per maximal-modularity partition is $\\sqrt{m}$.**\n",
      "\n",
      "This means that modularity maximization cannot resolve communities where the number of communities in the graph is greater than the square root of the number of edges in the graph, $\\sqrt{m}$.\n",
      "\n",
      "Conclusion (from *Resolution Limit in Community Detection*):\n",
      "**Communities found using a modularity-maximization strategy with a size $m_c \\approx \\sqrt{2m}$ or smaller may be the result of an arbitrary merge of similar smaller structures.**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Revisit the ring graph.\n",
      "- How many communities did we find?\n",
      "- Roughly what size are they?\n",
      "- There are 36 nodes. Try creating your own partitions and running modularity()"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Example:\n",
      "myPartition1 = [1,1,1,1,1,1,\n",
      "              2,2,2,2,2,2,\n",
      "              3,3,3,3,3,3,\n",
      "              4,4,4,4,4,4,\n",
      "              5,5,5,5,5,5,\n",
      "              6,6,6,6,6,6]\n",
      "print \"Modularity for 6 (sqrt(36)) evenly sized groups: {}\".format(modularity(ring_adjmat, myPartition1)) \n",
      "myPartition2 = [1 for x in range(0,36)]\n",
      "print \"Modularity for 1 large partition: {}\".format(modularity(ring_adjmat, myPartition2))\n",
      "myPartition3 = range(0,36)\n",
      "print \"Modularity for 36 partitions: {}\".format(modularity(ring_adjmat, myPartition3))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Anyway, that's about it. When modularity works, when it doesn't, and how to use it. Modularity and assortativity remain some of the most-used metrics in network science, especially when we are talking about social networks and community detection. Many, many papers use the output of a CNM (the fast-greedy aglomerative modularity maximization algorithm discussed here) as groud-truth when partitioning networks into communities. It is convenient, relatively fast (as fast as anything seraching a space of the size of the nth Bell number can be), and common enough that you can find an implementation somewhere (I chose igraph). It isn't necessarily wrong, but there are always caveats."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "THE END"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}